Skip to content

vttablet: handle applier metadata init failures in relay-log recovery#19560

Merged
mhamza15 merged 28 commits into
vitessio:mainfrom
mhamza15:vtorc-handle-applier-metadata-failure
Apr 3, 2026
Merged

vttablet: handle applier metadata init failures in relay-log recovery#19560
mhamza15 merged 28 commits into
vitessio:mainfrom
mhamza15:vtorc-handle-applier-metadata-failure

Conversation

@mhamza15

@mhamza15 mhamza15 commented Mar 4, 2026

Copy link
Copy Markdown
Collaborator

Description

handleRelayLogError currently retries replication restart for known recoverable metadata-init failures (relay log info and master info). MySQL can also return:

Replica failed to initialize applier metadata structure from the repository

This treats this error as the same recoverable class by triggering RestartReplication (STOP REPLICA, RESET REPLICA, START REPLICA).

Related Issue(s)

Fixes #19612

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Help from Codex.

@mhamza15 mhamza15 self-assigned this Mar 4, 2026
@github-actions github-actions Bot added this to the v24.0.0 milestone Mar 4, 2026
@codecov

codecov Bot commented Mar 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.68852% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.96%. Comparing base (70c7a72) to head (37ce03c).
⚠️ Report is 142 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/tabletmanager/rpc_replication.go 79.24% 11 Missing ⚠️
go/vt/vttablet/tabletmanager/restore.go 0.00% 1 Missing ⚠️
go/vt/vttablet/tabletmanager/tm_init.go 50.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (70c7a72) and HEAD (37ce03c). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (70c7a72) HEAD (37ce03c)
1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19560       +/-   ##
===========================================
- Coverage   69.67%   59.96%    -9.72%     
===========================================
  Files        1614      109     -1505     
  Lines      216793    17853   -198940     
===========================================
- Hits       151044    10705   -140339     
+ Misses      65749     7148    -58601     
Flag Coverage Δ
partial 59.96% <78.68%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

`handleRelayLogError` currently retries replication restart for known
recoverable metadata-init failures (relay log info and master info).
MySQL can also return:

```
Replica failed to initialize applier metadata structure from the repository
```

This treats this error as the same recoverable class by triggering
`RestartReplication` (STOP REPLICA, RESET REPLICA, START REPLICA).

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
@mhamza15 mhamza15 force-pushed the vtorc-handle-applier-metadata-failure branch from c82ddf5 to 721e63e Compare March 4, 2026 19:54
// The same fix also works for https://github.com/vitessio/vitess/issues/10955.
if strings.Contains(err.Error(), "Replica failed to initialize relay log info structure from the repository") ||
strings.Contains(err.Error(), "Could not initialize master info structure") {
if isRecoverableReplicationInitializationError(err) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have access to the MySQL error codes here? It feels quite brittle to check against the error message strings - but if that's the only thing we can do here it's fine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking the same thing, but it gets flattened earlier: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/query.go?plain=1#L84. It's definitely doable and preferable, but I think it'd require a bit of a refactor that I'll leave as a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that sqlerror.NewSQLErrorFromError sucks, but it's in other code 🤷

I just wanted to point out many RPCs map sqlerrors -> vterrors.Code. For example, we return vtrpcpb.Code_UNAVAILABLE if the error is of a class that probably means unavailable

@mattlord mattlord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is good, but a number of suggestions. Also some testing gaps identified by Claude:

  Testing Gaps

  4. Unit test doesn't verify substring matching with MySQL error prefixes

  TestHandleRelayLogError creates errors via errors.New(constantString), testing exact matches. In production, MySQL errors arrive wrapped with
  errno/sqlstate:

  ERROR 1872 (HY000): Replica failed to initialize relay log info structure from the repository
  ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log

  The strings.Contains logic handles this correctly, but the unit test doesn't prove it. Adding one test case with a realistic MySQL-wrapped error
  message would catch any future regression if someone changed Contains to e.g. HasPrefix or ==:

  {
      name:          "applier metadata error with MySQL prefix",
      inputErr:      errors.New("ERROR 1872 (HY000): Replica failed to initialize applier metadata structure from the repository"),
      shouldRestart: true,
  },

  5. Planned reparent integration tests don't cover masterInfoInitializationError

  The relayErrors table in TestPlannedReparentShardRelayLogError and TestPlannedReparentShardRelayLogErrorStartReplication covers "relay log info"
  and "applier metadata" but omits "master info". If you're converting to table-driven, including all three error variants is cheap and gives
  complete coverage. The reparent_utils_test.go does cover "master info", so cross-file coverage exists, but within a single test function the gap is
   inconsistent.

  6. TestHandleRelayLogError "unrelated error" case doesn't verify the error is returned unmodified

  The test checks require.ErrorIs(t, err, tc.inputErr) which is correct, but it doesn't verify that RestartReplication was NOT called (no
  CheckSuperQueryList verification for the non-restart path). The empty ExpectedExecuteSuperQueryList handles this implicitly since
  CheckSuperQueryList would fail if unexpected queries ran, but only if the test actually calls CheckSuperQueryList — which it does at the bottom. So
   this is fine, just subtle.

Comment thread go/vt/vttablet/tabletmanager/rpc_replication.go Outdated
Comment thread go/vt/vttablet/tabletmanager/rpc_replication.go
Comment thread go/vt/vttablet/tabletmanager/rpc_replication.go
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
mhamza15 added 2 commits April 1, 2026 10:45
…path

The reparent branch that restarts replication after a no-op source check still called `StartReplication` inline and routed the error through `handleRecoverableReplicationInitError` manually. That duplicated the helper we already use everywhere else for the same start recovery behavior.

This switches that branch to call `startReplicationRecoverable` directly so the explicit START REPLICA recovery stays in one place.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
The reparent restart branch still open-coded a recoverable `STOP REPLICA` path while start recovery had already been pulled into a helper. That left the stop handling inconsistent and made the restart path harder to read.

This adds `stopReplicationRecoverable`, uses it where the reparent restart flow already had identical recoverable-stop behavior, and adds a focused unit test for the new helper.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4762624b01

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread go/vt/vttablet/tabletmanager/rpc_replication.go Outdated
`SetReplicationSource` source-change failures were still handled with restart-style recovery, which can resume the old source before the requested one is known to be stored. The same recovery shape was also awkward for `STOP REPLICA`, where a recoverable stop failure could return success with replication running.

This changes `setReplicationSourceRecoverable` to repair recoverable source-change errors by `ResetReplicationParameters`, reapply the requested source, and only start replication when requested. It also stops attempting recoverable handling for `STOP REPLICA`, updates the helper comments, and adds regression coverage for running and non-running replicas.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Copilot AI review requested due to automatic review settings April 1, 2026 16:14
@mhamza15

mhamza15 commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator Author

Made some changes worth noting:

  • We perform SetReplicationSource in separate steps and recover accordingly. Recoverable failures in the replication source step recover by running RESET REPLICA ALL only so that we don't restart replication pointing at the old source, followed by setting the new source again. Recoverable failures in the optional start replication step just restart replication as normal, which runs STOP REPLICA -> RESET REPLICA -> START REPLICA.
  • Removed recovery from failures when attempting to stop replication. Since recovery involves stopping replication again, we'd likely to fail again. A proper recovery would be to run RESET REPLICA ALL, then re-apply the old source and leave replication stopped, but I'm electing to leave that as a potential follow-up rather than this PR.

@mhamza15 mhamza15 requested review from mattlord and nickvanw April 1, 2026 16:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

mhamza15 added 2 commits April 1, 2026 12:24
MySQL requires the replica SQL and I/O threads to be stopped before `RESET REPLICA [ALL]`, but the new `SetReplicationSource` recovery path could reach `ResetReplicationParameters` immediately after a failed source-change attempt on a running replica. That left the recovery logic depending on an unstated assumption about the failed `SetReplicationSource(..., stopReplicationBefore=true)` call.

This now issues an explicit `StopReplication` before `ResetReplicationParameters` when the replica was running, and updates the running-replica recovery test to expect that extra stop.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
The replication errno alias block in `go/mysql/sqlerror/constants.go` kept the old aligned spacing on `ERInnodbReadOnly`, which fails `gofumpt`. This reapplies the formatter output so CI accepts the file.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Copilot AI review requested due to automatic review settings April 1, 2026 16:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f8320ad1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 961 to 963
if err := tm.MysqlDaemon.StopReplication(ctx, tm.hookExtraEnv()); err != nil {
if err := tm.handleRelayLogError(ctx, err); err != nil {
return err
}
return err
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recover metadata-init errors when STOP REPLICA fails

In the status.SourceHost == host && status.SourcePort == port && shouldbeReplicating path, StopReplication errors are now returned directly, so recoverable metadata-init failures (1201/1871/1872) from STOP REPLICA no longer trigger the reset-and-restart self-heal. This branch is exercised when the source is already correct (e.g. no-op/planned reparent and force-start flows), so a transient/corrupt replication metadata condition that previously recovered can now abort reparent/replication setup instead of repairing itself.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, see #19560 (comment)

`TestPlannedReparentShardRelayLogError` still expected `PlannedReparentShard` to succeed when `STOP REPLICA` returned a recoverable metadata-init error, but the tabletmanager change intentionally removed stop recovery. That made the wrangler test fail in CI even though the supported `START REPLICA` recovery path still works.

This changes the stop-error PRS test to expect `SetReplicationSource` to fail and keeps the start-error companion test as the success-path coverage for the recovery we still support.

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37ce03c297

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return nil
}

return tm.startReplicationRecoverable(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid running postflight hook in SetReplicationSource path

When shouldStartReplication is true, this helper now starts replication via startReplicationRecoverable, which calls MysqlDaemon.StartReplication and therefore runs the postflight_start_slave hook. Previously, SetReplicationSource(..., startReplicationAfter=true) performed CHANGE REPLICATION SOURCE + START REPLICA without invoking that hook, so this change can make reparent/init flows fail in environments where the postflight hook is present and returns an error, even though source reconfiguration and SQL start succeeded. Please preserve the previous hook behavior for source-change starts (or make hook execution explicit/opt-in for this path).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly a behavior change, but I'm not sure it's a wrong change. I'd expect the hook to run in this case, but it may cause some unexpected behavior.

// The same fix also works for https://github.com/vitessio/vitess/issues/10955.
if strings.Contains(err.Error(), "Replica failed to initialize relay log info structure from the repository") ||
strings.Contains(err.Error(), "Could not initialize master info structure") {
if isRecoverableReplicationInitializationError(err) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that sqlerror.NewSQLErrorFromError sucks, but it's in other code 🤷

I just wanted to point out many RPCs map sqlerrors -> vterrors.Code. For example, we return vtrpcpb.Code_UNAVAILABLE if the error is of a class that probably means unavailable

@mhamza15 mhamza15 merged commit 4775281 into vitessio:main Apr 3, 2026
110 checks passed
@mhamza15 mhamza15 deleted the vtorc-handle-applier-metadata-failure branch April 3, 2026 15:06
mhamza15 added a commit to mhamza15/vitess that referenced this pull request Apr 3, 2026
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
mhamza15 added a commit that referenced this pull request Apr 3, 2026
…ay-log recovery (#19560) (#19788)

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
mhamza15 added a commit that referenced this pull request Apr 5, 2026
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
mhamza15 added a commit to planetscale/vitess that referenced this pull request Apr 9, 2026
mhamza15 pushed a commit that referenced this pull request Apr 14, 2026
timvaillancourt pushed a commit to timvaillancourt/vitess that referenced this pull request Apr 17, 2026
timvaillancourt pushed a commit to timvaillancourt/vitess that referenced this pull request May 12, 2026
…vitessio#19560)

Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: replicas do not self-heal when applier metadata initialization fails

6 participants